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ABSTRACT 

The purpose of this study was to determine the extent and 
magnitude of differential item functioning (DIF) between males and females in 
existing attitudinal data sets. The focus was on the approximate proportion 
of items that show statistically significant DIF in selected data sets 
concerning attitude scales, the magnitude of this DIF, and whether the items 
more often favor males or females. Two methods for detecting DIF, the 
Mantel -Haenszel (MH) procedure and logistic regression (LR) , were used. While 
more than 70 data sets were used, only 23 met the standards for inclusion in 
this study. These 23 data sets contained 54 acceptable scales with a total of 
42,370 subjects responding to 982 items. Results suggest that these scales 
may have had more than a few items functioning differentially by gender. 
Generally DIF, with respect to gender, appeared to be reasonably balanced 
between items favoring males and those favoring females. However, it cannot 
be concluded that the combined influence of the items (effect size) was 
impartial. The magnitude of the DIF was not trivial, being in the range of 
medium to large. Results also show that both methods of DIF detection yielded 
very similar results with respect to uniform DIF. Conditions favoring one 
approach over the other are discussed. An appendix contains brief 
descriptions of the studies considered. (Contains 1 table and 17 references.) 
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The Prevalence of Gender DIF in Survey Data 

In cognitive assessments, Differential Item Functioning (DIF) occurs when individuals 
from different subgroups, but at the same overall level of skill or ability, have different success 
in responding to an item (Hambleton, Swaminathan, & Rogers, 1991). Similarly, an item in a 
survey or attitude scale shows DIF if individuals from different subgroups, but with the same 
overall attitude, do not have the same probability of responding positively to an item (Hulin, 
Dragow, & Parsons, 1983). DIF with attitude data has been discussed previously (e.g., Benson, 
1987; Brown, 1996; Dancer, Anderson, & Derlin, 1991; Johanson, 1997; Marie, 1997). The 
purpose of the current study is to determine the extent and magnitude of DIF between males and 
females in existing attitudinal data sets. In particular, we determine the approximate proportion 
of items that show statistically significant gender DIF in selected data sets containing attitude 
scales, the magnitude of this DIF, and whether the items more often favor males or females. 

Depending on the reason for the differential functioning and the purpose of the attitude 
measure, a subset of differentially functioning items can be expected to be biased (Camilli & 
Shepard, 1994). A good deal of effort is expended in many cognitive assessments to identify DIF 
and revise or eliminate biased items. If DIF is widespread in surveys and attitude scales, then a 
similar level of effort may be warranted to identify such items and reduce inappropriate 
assessment. 

Items that function differentially can be identified for any distinct groups of subjects. 
Gender was chosen for this investigation for basically two reasons. First, gender is one of the 
more commonly used variables for DIF analysis in cognitive assessments. Gender differences are 
often of interest to researchers, evaluators, and policy makers and gender-biased items can alter 
these differences. Camilli & Shepard (1994) point out that "Test bias is most often an issue in the 
study racial and ethnic group differences and gender differences ", p. 8. The second reason for 
studying gender differences was more practical: information regarding gender is frequently 
included in existing data sets. 

Method 

Two methods for detecting DIF, the Mantel-Haenszel procedure (MH) and logistic 
regression (LR), were used in this study. The MH technique (Mantel & Haenszel, 1958) is 
commonly used to detect uniform DIF where there is no interaction between gender and the 
overall attitude. LR (Swaminathan & Rogers, 1990) is particularly effective in detecting non- 
uniform DIF. It was expected that the agreement between methods would be good where the DIF 
was uniform (Rogers & Swaminathan, 1993). Both methods provide a statistical test of 
significant DIF. 

Estimates of effect size or magnitude of the DIF are an important adjunct to null 
hypothesis testing and are reported in this study using the odds ratio associated with the MH 
procedure. Effect size can be understood as “the degree to which the phenomenon is present in 
the population” (Cohen, 1988, p. 9). Educational Testing Service defines a large effect for DIF in 
evaluating cognitive items to be one in which the statistical test is significant and where the 
absolute value of 2.35 times the natural logarithm of the odds ratio is at least 1.5 (Dorans & 
Holland, 1993). We used this criteria. Statistical tests were conducted at three levels of 
significance, a = .01, .05, .10. 
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In those instances where DIF is present, it may be reasonable to purify the criterion or 
total score used for matching by removing the items with DIF and rerunning the analyses of all 
items using the purified criterion. A recent study by Gierl, Jodoin, & Ackerman (2000) found 
that "...Type I error rates also remained close to the nominal alpha level of .05 with samples of 
250 examinees per group for MH, SIBTEST, and LR despite the large amount of DIF — up to 
60% — in the conditioning variable." (p. 18) with no purification. This study used simulated data 
with characteristics not unlike those encountered in actual data in the current study. Specifically, 
Gierl et al. (2000) created data in which 20%-60% of the items on a 40-item scale had moderate 
to large uniform DIF and the direction of the DIF was balanced with half the items favoring the 
focal group and half favoring the reference group. 

Data 

This research was heavily dependent on data available on the Internet and through 
personal communications with researchers. Data had to meet the following standards for 
inclusion. First, there must have been a scale consisting of multiple items measuring an attitude. 
Second, gender must have been reported. Third, data must have been available at the item level. 
Fourth, any data set with 25% or more missing values was excluded; mean-substitution was used 
to replace missing values where needed. Finally, the scale must have a passable reliability. 
Oosterhof (1994) states that a typical test should have a reliability of at least .60 to .80. 
Henderson, Morris, and Fitz-Gibbon, (1987) indicate that reliability should be at least .50. The 
decision was made to use only scales having a minimum reliability of .70 assuming a total of 40 
items. This criterion was adjusted as necessary for scales with different numbers of items using 
the Spearman-Brown formula (Crocker & Algina, 1986). While more than 70 data sets were 
examined, only 23 met the standards for inclusion and were used in this study. These 23 data sets 
contained 54 acceptable scales with a total of 42,730 subjects responding to 982 items. A brief 
description of each data set is given in the appendix. Table 1 gives information at the scale level. 

<insert Table 1 about here> 

Both MH and LR procedures require dichotomous responses. This was also the only 
possible common response format across scales. When scales used more than two response 
categories, data were recoded to, say, agreement and disagreement. If there was a middle 
response category (e.g., neutral or undecided), the decision was to place these responses into the 
smaller of the two previously indicated categories to obtain more balanced responses. While 
these decisions were somewhat arbitrary and caused a small loss of information, they also 
provided a shared or common basis for subsequent comparisons. 

Results 

The percentage of statistically significant DIF items at a = .01 was calculated for the total 
number of the study items (982). The results were 18.0% for MH, 19.2% for uniform LR, and 
5.6% for non-uniform LR. At a = .05, the results were 29. 1% for MH, 30.7% for uniform LR, 
and 1 1 .9% for non-uniform LR. The results were 36.2% for MH, 38.3% for uniform LR, and 
18.3% for non-uniform LR at a = . 10. Overall, a conservative estimate of the percentage of items 
that demonstrated uniform gender DIF was 15% at a = .01, 25% at a = .05, and 30% at a = .10. 
Percentages of DIF items for each scale at a = .05 are shown in Table 1. 
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The DEF in these 54 scales was surprisingly balanced with respect to gender. Across all 
combinations of significance level and method, the percentage of uniform DIF items favoring 
males (males were more in agreement with the item than females) ranged from 47.5% to 51.9%. 
The corresponding range for the percentage of items favoring females was 48. 1% to 52.5%. 
Overall, the mean proportion of items favoring males was 51.4% (a = .01), 48.6% (a = .05), and 
49.3% (a = . 10). In only 10 of the 54 scales studied did the gender imbalance amount to more 
than one item. 

For items favored by males, the mean odds-ratios ranged from .58 to .66 across all 
combinations of method and significance level. This could be interpreted to mean that males 
were, on average, from 34% (1 - .66) to 42% (1 - .58) more likely to agree with these attitude 
items than were females. The results using LR were very close to those of MH. Using the metric 
described earlier, the large effect size would require an odds-ratio a bit more extreme, 
approximately .53 or less. That is, males would have to be 47% or more likely than females to 
agree with items having a large effect. For items favored by females, the mean odds-ratios across 
method and significance level ranged from 1.84 to 2.97. These values are comparable to (the 
reciprocals of) the values for items favoring males. As with the items favoring males, we might 
characterize the magnitude of DIF, or the effect size of these statistically significant items, as 
medium to large. 

The extent of agreement between methods (MH and LR) was estimated using Cohen's 
Kappa, k , a measure of decision consistency that is adjusted for chance agreement (Crocker & 
Algina, 1986). With only uniform DEF, k values were .88 at a = .01, .88 at a = .05, and .83 at a 
= .10. Simple (uncorrected for chance) proportions of agreement were .97, .95, and .92, 
respectively. When the non-uniform DEF was included, the agreement dropped as expected. 
Kappa values for all DEF items ranged from .80 to .66 while simple proportions of agreement 
ranged from .94 to .83. 

Discussion 

It would appear that scales used in survey research may well have more than a few items 
that are functioning differentially by gender. If our selection of data sets and scales are relatively 
typical of attitude scales and we consider the results at a = .05, then the expected percentage of 
items showing statistically significant uniform DEF might well be near 25%. These results imply 
that survey researchers using attitude scales would be well advised to conduct DEF analyses in 
addition to the usual item analyses currently in common use. The prevalence of non-uniform DEF 
was substantially less than uniform DEF. 

Generally, and on average, differential functioning with respect to gender would appear 
to often be reasonably balanced between items favoring males and those favoring females. Of 
course, this is not meant to imply that in any one scale the number of items favoring one sex will 
be the same as the number favoring the other. Furthermore, even if the number of items showing 
statistically significant DEF is perfectly gender balanced, we cannot conclude that the combined 
influence of the items (effect size) is impartial. 

The magnitude of the DEF is not trivial. Using a classification scheme commonly used for 
cognitive items, we might conclude that a typical effect size for these identified items is in the 
range of medium to large. Clearly, several items of this sort favoring one gender over the other 
might well appreciably bias the assessment for many purposes. 
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This study has a number of limitations. The scales studied were not randomly selected 
and may not be representative of any population of attitude scales. The number of studies and 
scales included was relatively small. Only two methods of DIF detection were used and 
information was undoubtedly lost by dichotomizing responses in many of the scales. 

The Mantel-Haenszel method of DIF detection and the logistic regression approach 
yielded very similar results with respect to uniform DIF. The choice between these methods 
would seem to be less than critical for the data we examined. Logistic regression would be the 
clear preference for the detection of non-uniform DIF, while the availability of an effect-size 
indicator (the odds-ratio) would argue for use of the MH procedure. Both MH and LR are 
currently implemented in many commonly used statistical packages. 

The potential for gender bias would seem quite real whether a survey is being used as 
part of a program evaluation, for marketing purposes, or for a multitude of possible research 
purposes. The question of whether or not attitude scales have similarly extensive DIF problems 
with respect to groups defined by ethnicity, age, socioeconomic status, or other characteristics 
remains open. 
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Table 1 

Study Data Characteristics and DIF Results 



Number Name 


N 


Scale Items alpha MH 1 


LR 2 


NU 3 


1 


Customer Satisfaction Survey 


253 


1 


19 


.84 


11% 


11% 


0% 


2 


Church and Community Project 


5,123 


1 


10 


.80 


20% 


20% 


20% 








2 


12 


.81 


33% 


25% 


17% 








3 


25 


.68 


48% 


48% 


4% 


3 


The National Survey of the Religious 




- 














Life Future Project (ROOS) 


6,359 


1 


10 


.39 


70% 


70% 


30% 


4 


Multi-Investigator Survey 


1,464 


1 


12 


.68 


25% 


33% 


8% 








2 


14 


.48 


64% 


64% 


14% 


5 


Family Survey 


1,024 


1 


30 


.69 


33% 


33% 


7% 


6 


Religion and Politics Survey 


1,975 


1 


22 


.64 


55% 


59% 


9% 


7 


Race and Politics Survey 


2,223 


1 


11 


.43 


0% 


9% 


9% 








2 


14 


.74 


7% 


7% 


0% 








3 


14 


.62 


14% 


14% 


0% 








4 


20 


.66 


25% 


10% 


10% 


8 


Catholic Pluralism Project 


1,058 


1 


9 


.84 


33% 


33% 


0% 








2 


15 


.52 


0% 


0% 


14% 








3 


9 


.64 


22% 


33% 


11% 


9 


The Four-State Church Involvement 


2,620 


1 


9 


.39 


11% 


11% 


0% 








2 


14 


.52 


57% 


57% 


43% 


10 


General Social Survey (1983) 


1,599 


1 


12 


.58 


25% 


16% 


8% 








2 


13 


.88 


38% 


38% 


0% 








3 


21 


.66 


24% 


29% 


14% 








4 


14 


.82 


36% 


43% 


29% 








5 


11 


.44 


27% 


27% 


9% 


11 


Economic Expectations and Attitudes 


1,421 


1 


12 


.52 


67% 


67% 


8 % 








2 


17 


.53 


29% 


29% 


12 % 








3 


10 


.51 


30% 


30% 


10% 


12 


Survey of Children Aged 12-14 


1,374 


1 


23 


.73 


57% 


57% 


4% 


13 


Ministry with Young Adults Initiative 


2,879 


1 


44 


.78 


32% 


34% 


40% 








2 


25 


.73 


36% 


48% 


12% 








3 


26 


.93 


42% 


42% 


48% 


14 


Attitudes Toward Becoming Literate 


538 


1 


45 


.77 


53% 


62% 


27% 


15 


Middletown Area Study 


422 


1 


17 


.57 


12% 


18% 


0% 


16 


National Black Election Study 


1,216 


1 


16 


.58 


6% 


6% 


6% 








2 


15 


.59 


27% 


27% 


20% 
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(Continue Table 1) 









3 


19 


.64 


5% 


11% 


5% 








4 


15 


.55 


13% 


13% 


0% 








5 


17 


.57 


29% 


29% 


0% 


17 


Canadian Election Study 


3,949 


1 


21 


.57 


43% 


38% 


14% 








2 


23 


.75 


52% 


52% 


13% 


18 


Attitudes Toward the Environment 


















and Local Politics 


1,005 


1 


16 


.52 


38% 


44% 


13% 








2 


36 


.73 


47% 


50% 


17% 


19 


Social Inequality Survey 


1,101 


1 


12 


.48 


25% 


25% 


0% 








2 


13 


.58 


8% 


8% 


15% 








3 


15 


.54 


0% 


0% 


7% 


20 


First- Year Undergraduate Involvement 


200 


1 


18 


.78 


11% 


11% 


0% 








2 


10 


.80 


0% 


0% 


0% 








3 


10 


.91 


10% 


10% 


10% 








4 


27 


.95 


0% 


0% 


4% 


21 


French National Election Study 


4,078 


1 


23 


.59 


56% 


70% 


26% 


22 


British General Election Study - Ethnic 


















Minority Survey 


705 


1 


34 


.90 


3% 


12% 


0% 








2 


22 


.80 


14% 


9% 


5% 








3 


23 


.89 


48% 


48% 


4% 


23 


An Analysis of the Proposed Subtype: 


















Hopelessness Depression 


144 


1 


18 


.92 


0% 


0% 


0% 








2 


20 


.94 


5% 


10% 


0% 



'iVIH indicates the percentage of statistically significant uniform DIF items at a = .05 using the 
Mantel-Haenszel procedure. 

2 LR indicates the percentage of statistically significant uniform DIF items at a = .05 using 
logistic regression. 

3 NU indicates the percentage of statistically significant non-uniform DIF items at a = .05 using 
logistic regression. 
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Appendix 

1. Customer Satisfaction Survey 

The Customer Satisfaction Survey was conducted for a retail chain with five stores in the 
southeast of the USA (SPSS Technical Support, 1999). The sample size was 253 subjects. A 
scale of 19 items was used for this study. 

2. Church and Community Project 

This study was conducted in various cities and towns in Illinois and Indiana in 1987. The study 
aimed to learn about the beliefs and attitudes of members about the basic aspects of church life. 

A sample size of 5,123 subjects participated in the study (American Religion Data Archive, 

1998) 

3. The National Survey of the Religious Life Future Project (ROSSI 

The purpose of this project was to collect information about the beliefs, values and attitudes of 
members of religious orders. This information could be used as a database for the study of 
religious life on the individual, congregational, and social institution levels. The study was 
conducted the early 1990. The sample size of this study was 6,359 and the overall response rate 
was 77.4% (American Religion Data Archive, 1998). 

4. Multi - Investigator Survey 

This is a national random-digit telephone survey with 1,464 respondents supported by the 
National Science Foundation. The survey was conducted in 1994 by the Survey Research Center 
of the University of California. The population was all English-speaking adults residing in 
households with telephones. (Survey Documentation and Analysis, no date). 

5. Family Survey 

This study was conducted in 1994 as a part of The International Social Survey Program (ISSP). 
The survey covered some important topics related to family and social life. The sample size was 
1,024 subjects (Sociological Data Archive, 1998). 

6. Religion and Politics Survey 

The Princeton Survey Research Associates conducted this 1 996 nationwide survey by phone 
among 1,975 adults. The subjects were 18 years of age or older (American Religion Data 
Archive, 1998). 

7. Race and Politics Survey 

The National Race and Politics Survey was conducted by the Survey Research Center of the 
University of California in 1991. It was a nationwide random-digit telephone survey with a 
sample of 2,223 subjects. The population was all English-speaking adults 1 8 years age or older, 
residing in households with telephones. (Survey Documentation and Analysis, no date). 

8. Catholic Pluralism Project 

This telephone interview survey with 1,058 respondents was conducted in 1995 to provide social 
and religious information on American Catholicism (American Religion Data Archive, 1998). 

9. The Four-State Church Involvement 

This study of 2,620 subjects was conducted in fall, 1988, in four states: Ohio, Massachusetts, 
North Carolina, and California (American Religion Data Archive, 1998). 
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10. General Social Survey (GSS) 

Face-to-face interviews were used in the GSS to collect the 1,599 subjects’ responses to the 
survey questions. GSS contains questions that measure American adult characteristics, attitudes 
and behaviors. A portion of the GSS data of 1983 was used in this study (Survey Documentation 
and Analysis, no date). 

11. Economic Expectations and Attitudes 

The Socioeconomic Team of the Institute of Sociology, Academy of Science Czech Republic, 
Prague conducted this study. Attitudes towards the central problems of economic transformation 
were assessed in this survey. The most important political issues such as voting preference trust 
in institution, and attitudes towards the “strong hand” of the government were also covered. The 
sample size was 1,421 subjects (Sociological Data Archive, 1998). 

12. Survey of Children Aged 12-14 

The Educational Testing Services designed and conducted this survey in 1993. The survey 
covered different areas such as the desired outcomes and the effectiveness of programs in 
fostering these outcomes. The sample size was 1,374 subjects (American Religion Data Archive, 
1998). 

13. Ministry with Young Adults Initiative 

This study was conducted by the National Conference of Catholic Bishops in 1995. Young adults 
are people in their late teens, twenties and thirties, may live in different places, and may be 
employed or in college. The sample size was 2,879 subjects (American Religion Data Archive, 
1998). 

14. Attitudes toward Becoming Literate Survey 

This study examined the relationship of gender, age and economic status on adult student 
attitudes toward becoming literate. A random sample of 538 adults was selected from the 
population of the Saudi Arabia Literacy and Adult Education Program (Boudy, 1999). 

15. Middletown Area Study 

This study has been conducted every year since 1978. It assesses citizen’s views on subjects such 
as life satisfaction, education, income, family, religion, and others. Data from the 1995 study 
conducted in Delaware County, Indiana with 422 subjects was used. 

16. National Black Election Study 

This study was conducted in 1996 to provide information about Black political preferences 
during the 1996 presidential election. The survey of 1,216 subjects contained questions on a 
diverse range of issues such as perception and evaluations of candidates, opinions on public 
policy, participation in political life, race, and economic matters. 

17. Canadian Election Study 

The population of this study was Canadian citizens 18 years of age or older who spoke English 
or French and lived in private homes in Canada. The survey covered different political attitudes 
and voting behavior and was conducted in 1997 with 3,949 subjects. Additional topics were 
government spending, social issues and values including abortion, unions, business, education, 
and health care. 
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18. Attitudes toward the Environment and Local Politics 

This survey was conducted in the Czech Republic by the Research on Local and Regional 
Problematic, Institute of Sociology, Academy of Science Czech Republic in 1993. This survey 
was a part of the International Social Survey Program (ISSP). The sample size was 1,005 and the 
response rate was 72% (Sociological Data Archive, 1998). 

19. Social Inequality Survey 

This survey, conducted annually in many countries, was a part of the International Social Survey 
Program (ISSP). Data from the 1992 study in Czechoslovakia was used. The study covered a 
variety of social and economic topics with 1,101 subjects. 

20. First - Year Undergraduate Involvement 

This study is conducted annually by the Office of Institutional Research at Ohio University The 
study assessed student involvement in activities related to their formal education. It also solicited 
information on students’ academic involvement, their social involvement and activities, and their 
personal goals and adjustment to college. Data from the 1994-1995 survey with 200 respondents 
was used in this study. 

21 French National Election Study 

This is a national survey with 4,078 subjects conducted to assess the attitudes and opinions of the 
French people regarding the election of 1995. Different topics were covered such as interest in 
politics, ideological leanings, voting behavior, and the effects of television on respondent's 
choice of presidential candidates. . 

22. British General Election Study-Ethnic Minority Survey 

This study was conducted in 1997 on the eligible ethnic minorities in Britain The respondents 
were Black, Indian, Pakistani, and Bangladesh. The survey has several aims such as to evaluate 
the voting of ethnic minorities, to examine whether their political attitudes differed from majority 
attitudes, and to explore whether these members were influenced by different considerations than 
the majority. The sample size was 705. 

23. An Analysis of the Proposed Subtype: Hopelessness Depression 

This study aimed to examine hopelessness and the prevalence of two types of unipolar 
depression. This study was conducted in 1999 on 144 college students in the Midwest. 
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